iT邦幫忙

2023 iThome 鐵人賽

DAY 28
0
自我挑戰組

深度學習的學習 & ASR 中文語音辨識系列 第 28

【Day 28】Whisper model 的快樂 Fine-tuning 時間 - 3

  • 分享至 

  • xImage
  •  

你以為我要說怎麼把自己的資料轉成 Datasets 相符的格式嗎?不要。要準備資料好麻煩
所以我就繼續講下去囉
這邊可以用一個變數來表示我們要 fine-tune 哪一個原始模型,這邊選擇 Whisper small 版
selected_model = "openai/whisper-small"
可能會有人問為何不要用 medium 的?
因為不是大家的 GPU 都很厲害,我會在下一篇提到如果用 medium 會怎麼樣

Import

再來是特徵提取,會用到 FeatureExtractor
下面記得改成你要的模型

from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

這邊則是 Tokenizer,要先把模型抓進來 pre-trained

from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="chinese", task="transcribe")

這邊再多一個 Processor

from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="chinese", task="transcribe")

Prepare Data

然後把所有音檔的採樣頻率轉成 16k

from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

print(common_voice["train"][0])

印出來 sample_rate 的資訊應該都變成了 16000

對 Dataset 做些小處理

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1)

這邊還要注意一下後面原本是 num_proc=4,跑下去有問題或電腦沒有那麼強切成 1 或許會比較好

Data Collator

import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

那就先到這!

Ref.

小心得

來台北好累,剩下三天!


上一篇
【Day 27】Whisper model 的快樂 Fine-tuning 時間 - 2
下一篇
【Day 29】Whisper model 的快樂 Fine-tuning 時間 - 4
系列文
深度學習的學習 & ASR 中文語音辨識30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言